{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Air Quality Data Set\n",
    "\n",
    "In this notebook we will prepare and store the Air Quality Data Set from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Air+Quality)\n",
    "\n",
    "**Citation:**\n",
    "\n",
    "Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.\n",
    "\n",
    "\n",
    "## Download and unzip the data\n",
    "\n",
    "- Navigate to the [data folder](https://archive.ics.uci.edu/dataset/360/air+quality).\n",
    "- Download the zip file called **AirQualityUCI.zip**.\n",
    "- Unzip it.\n",
    "- Save the csv file called **AirQualityUCI.csv** into the **datasets** folder at the root of this repository."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# If you downloaded and stored the file as explained\n",
    "# above, it should be located here:\n",
    "\n",
    "filename = '../Datasets/AirQualityUCI.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(9357, 14)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load the data\n",
    "\n",
    "data = pd.read_csv(\n",
    "    filename, sep=';', parse_dates=[['Date', 'Time']]\n",
    ").iloc[:, :-2]  # drops last 2 columns, not real variables\n",
    "\n",
    "# drop missing values\n",
    "# these are added at the end of the file during reading\n",
    "data.dropna(inplace=True)\n",
    "\n",
    "data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date_Time</th>\n",
       "      <th>CO(GT)</th>\n",
       "      <th>PT08.S1(CO)</th>\n",
       "      <th>NMHC(GT)</th>\n",
       "      <th>C6H6(GT)</th>\n",
       "      <th>PT08.S2(NMHC)</th>\n",
       "      <th>NOx(GT)</th>\n",
       "      <th>PT08.S3(NOx)</th>\n",
       "      <th>NO2(GT)</th>\n",
       "      <th>PT08.S4(NO2)</th>\n",
       "      <th>PT08.S5(O3)</th>\n",
       "      <th>T</th>\n",
       "      <th>RH</th>\n",
       "      <th>AH</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10/03/2004 18.00.00</td>\n",
       "      <td>2,6</td>\n",
       "      <td>1360.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>11,9</td>\n",
       "      <td>1046.0</td>\n",
       "      <td>166.0</td>\n",
       "      <td>1056.0</td>\n",
       "      <td>113.0</td>\n",
       "      <td>1692.0</td>\n",
       "      <td>1268.0</td>\n",
       "      <td>13,6</td>\n",
       "      <td>48,9</td>\n",
       "      <td>0,7578</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10/03/2004 19.00.00</td>\n",
       "      <td>2</td>\n",
       "      <td>1292.0</td>\n",
       "      <td>112.0</td>\n",
       "      <td>9,4</td>\n",
       "      <td>955.0</td>\n",
       "      <td>103.0</td>\n",
       "      <td>1174.0</td>\n",
       "      <td>92.0</td>\n",
       "      <td>1559.0</td>\n",
       "      <td>972.0</td>\n",
       "      <td>13,3</td>\n",
       "      <td>47,7</td>\n",
       "      <td>0,7255</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10/03/2004 20.00.00</td>\n",
       "      <td>2,2</td>\n",
       "      <td>1402.0</td>\n",
       "      <td>88.0</td>\n",
       "      <td>9,0</td>\n",
       "      <td>939.0</td>\n",
       "      <td>131.0</td>\n",
       "      <td>1140.0</td>\n",
       "      <td>114.0</td>\n",
       "      <td>1555.0</td>\n",
       "      <td>1074.0</td>\n",
       "      <td>11,9</td>\n",
       "      <td>54,0</td>\n",
       "      <td>0,7502</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>10/03/2004 21.00.00</td>\n",
       "      <td>2,2</td>\n",
       "      <td>1376.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>9,2</td>\n",
       "      <td>948.0</td>\n",
       "      <td>172.0</td>\n",
       "      <td>1092.0</td>\n",
       "      <td>122.0</td>\n",
       "      <td>1584.0</td>\n",
       "      <td>1203.0</td>\n",
       "      <td>11,0</td>\n",
       "      <td>60,0</td>\n",
       "      <td>0,7867</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10/03/2004 22.00.00</td>\n",
       "      <td>1,6</td>\n",
       "      <td>1272.0</td>\n",
       "      <td>51.0</td>\n",
       "      <td>6,5</td>\n",
       "      <td>836.0</td>\n",
       "      <td>131.0</td>\n",
       "      <td>1205.0</td>\n",
       "      <td>116.0</td>\n",
       "      <td>1490.0</td>\n",
       "      <td>1110.0</td>\n",
       "      <td>11,2</td>\n",
       "      <td>59,6</td>\n",
       "      <td>0,7888</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Date_Time CO(GT)  PT08.S1(CO)  NMHC(GT) C6H6(GT)  PT08.S2(NMHC)  \\\n",
       "0  10/03/2004 18.00.00    2,6       1360.0     150.0     11,9         1046.0   \n",
       "1  10/03/2004 19.00.00      2       1292.0     112.0      9,4          955.0   \n",
       "2  10/03/2004 20.00.00    2,2       1402.0      88.0      9,0          939.0   \n",
       "3  10/03/2004 21.00.00    2,2       1376.0      80.0      9,2          948.0   \n",
       "4  10/03/2004 22.00.00    1,6       1272.0      51.0      6,5          836.0   \n",
       "\n",
       "   NOx(GT)  PT08.S3(NOx)  NO2(GT)  PT08.S4(NO2)  PT08.S5(O3)     T    RH  \\\n",
       "0    166.0        1056.0    113.0        1692.0       1268.0  13,6  48,9   \n",
       "1    103.0        1174.0     92.0        1559.0        972.0  13,3  47,7   \n",
       "2    131.0        1140.0    114.0        1555.0       1074.0  11,9  54,0   \n",
       "3    172.0        1092.0    122.0        1584.0       1203.0  11,0  60,0   \n",
       "4    131.0        1205.0    116.0        1490.0       1110.0  11,2  59,6   \n",
       "\n",
       "       AH  \n",
       "0  0,7578  \n",
       "1  0,7255  \n",
       "2  0,7502  \n",
       "3  0,7867  \n",
       "4  0,7888  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Attribute Information:\n",
    "\n",
    "Taken from the [original website](https://archive.ics.uci.edu/ml/datasets/Air+Quality).\n",
    "\n",
    "- 0 Date (DD/MM/YYYY)\n",
    "- 1 Time (HH.MM.SS)\n",
    "\n",
    "The above were merged during loading into the Date_Time column\n",
    "\n",
    "\n",
    "- 2 True hourly averaged concentration CO in mg/m^3 (reference analyzer)\n",
    "- 3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)\n",
    "- 4 True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)\n",
    "- 5 True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)\n",
    "- 6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)\n",
    "- 7 True hourly averaged NOx concentration in ppb (reference analyzer)\n",
    "- 8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)\n",
    "- 9 True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)\n",
    "- 10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)\n",
    "- 11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)\n",
    "- 12 Temperature in Â°C\n",
    "- 13 Relative Humidity (%)\n",
    "- 14 AH Absolute Humidity "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# I will give the variables simpler names\n",
    "# more details at the end of the notebook\n",
    "\n",
    "new_var_names = [\n",
    "    'Date_Time',\n",
    "    'CO_true',\n",
    "    'CO_sensor',\n",
    "    'NMHC_true',\n",
    "    'C6H6_true',\n",
    "    'NMHC_sensor',\n",
    "    'NOX_true',\n",
    "    'NOX_sensor',\n",
    "    'NO2_true',\n",
    "    'NO2_sensor',\n",
    "    'O3_sensor',\n",
    "    'T',\n",
    "    'RH',\n",
    "    'AH',    \n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Date_Time', 'CO_true', 'CO_sensor', 'NMHC_true', 'C6H6_true',\n",
       "       'NMHC_sensor', 'NOX_true', 'NOX_sensor', 'NO2_true', 'NO2_sensor',\n",
       "       'O3_sensor', 'T', 'RH', 'AH'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.columns = new_var_names\n",
    "\n",
    "data.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['CO_true', 'CO_sensor', 'NMHC_true', 'C6H6_true', 'NMHC_sensor',\n",
       "       'NOX_true', 'NOX_sensor', 'NO2_true', 'NO2_sensor', 'O3_sensor', 'T',\n",
       "       'RH', 'AH'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's capture the variables\n",
    "\n",
    "predictors = data.columns[1:]\n",
    "\n",
    "predictors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date_Time</th>\n",
       "      <th>CO_true</th>\n",
       "      <th>CO_sensor</th>\n",
       "      <th>NMHC_true</th>\n",
       "      <th>C6H6_true</th>\n",
       "      <th>NMHC_sensor</th>\n",
       "      <th>NOX_true</th>\n",
       "      <th>NOX_sensor</th>\n",
       "      <th>NO2_true</th>\n",
       "      <th>NO2_sensor</th>\n",
       "      <th>O3_sensor</th>\n",
       "      <th>T</th>\n",
       "      <th>RH</th>\n",
       "      <th>AH</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10/03/2004 18.00.00</td>\n",
       "      <td>2.6</td>\n",
       "      <td>1360.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>11.9</td>\n",
       "      <td>1046.0</td>\n",
       "      <td>166.0</td>\n",
       "      <td>1056.0</td>\n",
       "      <td>113.0</td>\n",
       "      <td>1692.0</td>\n",
       "      <td>1268.0</td>\n",
       "      <td>13.6</td>\n",
       "      <td>48.9</td>\n",
       "      <td>0.7578</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10/03/2004 19.00.00</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1292.0</td>\n",
       "      <td>112.0</td>\n",
       "      <td>9.4</td>\n",
       "      <td>955.0</td>\n",
       "      <td>103.0</td>\n",
       "      <td>1174.0</td>\n",
       "      <td>92.0</td>\n",
       "      <td>1559.0</td>\n",
       "      <td>972.0</td>\n",
       "      <td>13.3</td>\n",
       "      <td>47.7</td>\n",
       "      <td>0.7255</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10/03/2004 20.00.00</td>\n",
       "      <td>2.2</td>\n",
       "      <td>1402.0</td>\n",
       "      <td>88.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>939.0</td>\n",
       "      <td>131.0</td>\n",
       "      <td>1140.0</td>\n",
       "      <td>114.0</td>\n",
       "      <td>1555.0</td>\n",
       "      <td>1074.0</td>\n",
       "      <td>11.9</td>\n",
       "      <td>54.0</td>\n",
       "      <td>0.7502</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>10/03/2004 21.00.00</td>\n",
       "      <td>2.2</td>\n",
       "      <td>1376.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>9.2</td>\n",
       "      <td>948.0</td>\n",
       "      <td>172.0</td>\n",
       "      <td>1092.0</td>\n",
       "      <td>122.0</td>\n",
       "      <td>1584.0</td>\n",
       "      <td>1203.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>60.0</td>\n",
       "      <td>0.7867</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10/03/2004 22.00.00</td>\n",
       "      <td>1.6</td>\n",
       "      <td>1272.0</td>\n",
       "      <td>51.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>836.0</td>\n",
       "      <td>131.0</td>\n",
       "      <td>1205.0</td>\n",
       "      <td>116.0</td>\n",
       "      <td>1490.0</td>\n",
       "      <td>1110.0</td>\n",
       "      <td>11.2</td>\n",
       "      <td>59.6</td>\n",
       "      <td>0.7888</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Date_Time  CO_true  CO_sensor  NMHC_true  C6H6_true  NMHC_sensor  \\\n",
       "0  10/03/2004 18.00.00      2.6     1360.0      150.0       11.9       1046.0   \n",
       "1  10/03/2004 19.00.00      2.0     1292.0      112.0        9.4        955.0   \n",
       "2  10/03/2004 20.00.00      2.2     1402.0       88.0        9.0        939.0   \n",
       "3  10/03/2004 21.00.00      2.2     1376.0       80.0        9.2        948.0   \n",
       "4  10/03/2004 22.00.00      1.6     1272.0       51.0        6.5        836.0   \n",
       "\n",
       "   NOX_true  NOX_sensor  NO2_true  NO2_sensor  O3_sensor     T    RH      AH  \n",
       "0     166.0      1056.0     113.0      1692.0     1268.0  13.6  48.9  0.7578  \n",
       "1     103.0      1174.0      92.0      1559.0      972.0  13.3  47.7  0.7255  \n",
       "2     131.0      1140.0     114.0      1555.0     1074.0  11.9  54.0  0.7502  \n",
       "3     172.0      1092.0     122.0      1584.0     1203.0  11.0  60.0  0.7867  \n",
       "4     131.0      1205.0     116.0      1490.0     1110.0  11.2  59.6  0.7888  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# cast variables as numeric (they are strings by defo)\n",
    "# need to replace the , by . to cast as numeric\n",
    "\n",
    "for var in predictors:\n",
    "    if data[var].dtype =='O':\n",
    "        data[var] = data[var].str.replace(',', '.')\n",
    "        data[var] = pd.to_numeric(data[var])\n",
    "\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date_Time</th>\n",
       "      <th>CO_true</th>\n",
       "      <th>CO_sensor</th>\n",
       "      <th>NMHC_true</th>\n",
       "      <th>C6H6_true</th>\n",
       "      <th>NMHC_sensor</th>\n",
       "      <th>NOX_true</th>\n",
       "      <th>NOX_sensor</th>\n",
       "      <th>NO2_true</th>\n",
       "      <th>NO2_sensor</th>\n",
       "      <th>O3_sensor</th>\n",
       "      <th>T</th>\n",
       "      <th>RH</th>\n",
       "      <th>AH</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [Date_Time, CO_true, CO_sensor, NMHC_true, C6H6_true, NMHC_sensor, NOX_true, NOX_sensor, NO2_true, NO2_sensor, O3_sensor, T, RH, AH]\n",
       "Index: []"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[data['Date_Time'].apply(lambda x: len(x))>19]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date_Time</th>\n",
       "      <th>CO_true</th>\n",
       "      <th>CO_sensor</th>\n",
       "      <th>NMHC_true</th>\n",
       "      <th>C6H6_true</th>\n",
       "      <th>NMHC_sensor</th>\n",
       "      <th>NOX_true</th>\n",
       "      <th>NOX_sensor</th>\n",
       "      <th>NO2_true</th>\n",
       "      <th>NO2_sensor</th>\n",
       "      <th>O3_sensor</th>\n",
       "      <th>T</th>\n",
       "      <th>RH</th>\n",
       "      <th>AH</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2004-10-03 18:00:00</td>\n",
       "      <td>2.6</td>\n",
       "      <td>1360.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>11.9</td>\n",
       "      <td>1046.0</td>\n",
       "      <td>166.0</td>\n",
       "      <td>1056.0</td>\n",
       "      <td>113.0</td>\n",
       "      <td>1692.0</td>\n",
       "      <td>1268.0</td>\n",
       "      <td>13.6</td>\n",
       "      <td>48.9</td>\n",
       "      <td>0.7578</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2004-10-03 19:00:00</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1292.0</td>\n",
       "      <td>112.0</td>\n",
       "      <td>9.4</td>\n",
       "      <td>955.0</td>\n",
       "      <td>103.0</td>\n",
       "      <td>1174.0</td>\n",
       "      <td>92.0</td>\n",
       "      <td>1559.0</td>\n",
       "      <td>972.0</td>\n",
       "      <td>13.3</td>\n",
       "      <td>47.7</td>\n",
       "      <td>0.7255</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2004-10-03 20:00:00</td>\n",
       "      <td>2.2</td>\n",
       "      <td>1402.0</td>\n",
       "      <td>88.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>939.0</td>\n",
       "      <td>131.0</td>\n",
       "      <td>1140.0</td>\n",
       "      <td>114.0</td>\n",
       "      <td>1555.0</td>\n",
       "      <td>1074.0</td>\n",
       "      <td>11.9</td>\n",
       "      <td>54.0</td>\n",
       "      <td>0.7502</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2004-10-03 21:00:00</td>\n",
       "      <td>2.2</td>\n",
       "      <td>1376.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>9.2</td>\n",
       "      <td>948.0</td>\n",
       "      <td>172.0</td>\n",
       "      <td>1092.0</td>\n",
       "      <td>122.0</td>\n",
       "      <td>1584.0</td>\n",
       "      <td>1203.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>60.0</td>\n",
       "      <td>0.7867</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2004-10-03 22:00:00</td>\n",
       "      <td>1.6</td>\n",
       "      <td>1272.0</td>\n",
       "      <td>51.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>836.0</td>\n",
       "      <td>131.0</td>\n",
       "      <td>1205.0</td>\n",
       "      <td>116.0</td>\n",
       "      <td>1490.0</td>\n",
       "      <td>1110.0</td>\n",
       "      <td>11.2</td>\n",
       "      <td>59.6</td>\n",
       "      <td>0.7888</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Date_Time  CO_true  CO_sensor  NMHC_true  C6H6_true  NMHC_sensor  \\\n",
       "0 2004-10-03 18:00:00      2.6     1360.0      150.0       11.9       1046.0   \n",
       "1 2004-10-03 19:00:00      2.0     1292.0      112.0        9.4        955.0   \n",
       "2 2004-10-03 20:00:00      2.2     1402.0       88.0        9.0        939.0   \n",
       "3 2004-10-03 21:00:00      2.2     1376.0       80.0        9.2        948.0   \n",
       "4 2004-10-03 22:00:00      1.6     1272.0       51.0        6.5        836.0   \n",
       "\n",
       "   NOX_true  NOX_sensor  NO2_true  NO2_sensor  O3_sensor     T    RH      AH  \n",
       "0     166.0      1056.0     113.0      1692.0     1268.0  13.6  48.9  0.7578  \n",
       "1     103.0      1174.0      92.0      1559.0      972.0  13.3  47.7  0.7255  \n",
       "2     131.0      1140.0     114.0      1555.0     1074.0  11.9  54.0  0.7502  \n",
       "3     172.0      1092.0     122.0      1584.0     1203.0  11.0  60.0  0.7867  \n",
       "4     131.0      1205.0     116.0      1490.0     1110.0  11.2  59.6  0.7888  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# cast date and time variable as datetime\n",
    "# replace . by : to transform to datetime format\n",
    "\n",
    "data['Date_Time'] = data['Date_Time'].str.replace('.', ':', regex=False)\n",
    "\n",
    "data['Date_Time'] = pd.to_datetime(data['Date_Time'])\n",
    "# use dayfirst=True parameter if format is dd/mm/yyyy HH:mm:ss Eg: pd.to_datetime(data['Date_Time'], dayfirst=True)\n",
    "\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sort index\n",
    "# we want the data in time order\n",
    "\n",
    "data.sort_index(inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Date_Time      datetime64[ns]\n",
       "CO_true               float64\n",
       "CO_sensor             float64\n",
       "NMHC_true             float64\n",
       "C6H6_true             float64\n",
       "NMHC_sensor           float64\n",
       "NOX_true              float64\n",
       "NOX_sensor            float64\n",
       "NO2_true              float64\n",
       "NO2_sensor            float64\n",
       "O3_sensor             float64\n",
       "T                     float64\n",
       "RH                    float64\n",
       "AH                    float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check the format\n",
    "\n",
    "data.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# sanity check: duplicates in dt variable\n",
    "\n",
    "data['Date_Time'].duplicated().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Date_Time      0\n",
       "CO_true        0\n",
       "CO_sensor      0\n",
       "NMHC_true      0\n",
       "C6H6_true      0\n",
       "NMHC_sensor    0\n",
       "NOX_true       0\n",
       "NOX_sensor     0\n",
       "NO2_true       0\n",
       "NO2_sensor     0\n",
       "O3_sensor      0\n",
       "T              0\n",
       "RH             0\n",
       "AH             0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check NA\n",
    "\n",
    "data.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "min   2004-01-04 00:00:00\n",
       "max   2005-12-03 23:00:00\n",
       "Name: Date_Time, dtype: datetime64[ns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check time span\n",
    "\n",
    "data['Date_Time'].agg(['min', 'max'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save preprocessed data\n",
    "\n",
    "data.to_csv('../Datasets/AirQualityUCI_ready.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data set Summary\n",
    "\n",
    "The dataset was collected between January 2004 and March 2005.\n",
    "\n",
    "It consists of hourly measurements of the different air pollutants, NO2, NOX, CO, C6H6, O3 and NMHC. The measurements are accompanied by local temperature and humidity values, also recorded hourly.\n",
    "\n",
    "In the data collection experiments, scientists were testing new pollutant sensors. The values from the new sensors are stored in the variables called _sensors. \n",
    "\n",
    "For comparison, data for the pollutants was also gathered from fixed stations, that regularly measure the concentration of these gases. Those values are stored in the variables called _true."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
